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The proposed paper presents a variety novel uses of Space-Filling-Curves (SFCs) for Cartesian 
mesh methods in CFD. While these techniques will be demonstrated using non-body-fitted Cartesian 
meshes, most are applicable on general body-fitted meshes - both structured and unstructured. We 
demonstrate the use of single 6(N log N) SFC-based reordering to produce single-pass (0(N)) algo- 
rithms for mesh partitioning, multigrid coarsening, and inter-mesh interpolation. The intermesh 
interpolation operator has many practical applications including “warm starts” on modified geome- 
try, or as an inter-grid transfer operator on remeshed regions in moving-body simulations. Exploiting 
the compact construction of these operators, we further show that these algorithms are highly amena- 
ble to parallelization. Examples using the SFC-based mesh partitioner show nearly linear speedup to 
512 CPUs even when using multigrid as a smoother. Partition statistics are presented showing that the 
SFC partitions are, on-average, within 10% of ideal even with only around 50,000 cells in each sub- 
domain. The inter-mesh interpolation operator also has linear asymptotic complexity and can be used 
to map a solution with N unknowns to another mesh with M unknowns with 6<max(M^V)) operations. 
This capability is demonstrated both on moving-body simulations and in mapping solutions to per- 
turbed meshes for finite-difference-based gradient design methods. 


1 Introduction 

T HE literature on numerical solution methods for PDE’s 
contains a wealth of diverse approaches for improving 
cache performance, performing domain decomposition, mesh 
coarsening and inter-mesh solution transfer. These topics are 
frequently considered separately and usually require different 
data structures and infrastructure to perform each task. For 
example, a high-performance distributed memory unstruc- 
tured mesh code may use RCM ordering for improving 
cache-performance, recursive spectral bisection, or multi- 
level nested dissection for domain decomposition^ 1 ^, a 
graph-based mesh coarser^ 33 , and a variety of fast spatial 
search data structures for inter-mesh interpolation, or solu- 
tion transfer to re-gridded regions of a subdomain. 

For each task, these techniques offer superb performance, 
and many of them are amenable to some degree of parallel- 
ization. Nevertheless, development and maintenance of such 
a diverse collection of tools requires considerable investment, 
and when a substantial overhaul is necessary (e.g. moving a 
code to distributed-memory parallelization) the level of effort 
required will be significant. 

The proposed paper examines the use of space-filling-curves 
as a unifying data structure for all of these ends. Construction 
of these curves is extremely inexpensive as the SFC index for 
any voxel in space may be computed using only local infor- 
mation, and thus computation of indices may be performed in 
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parallel. The asymptotic time complexity of constructing the 
curve is therefore bounded by the sort algorithm which 
orders the mesh along the curve. This sort can be performed 
with standard sorting routines such as a quicksort which pro- 
duces a method with 6(N log N) running time. 

2 Space-Filling-Curves 

The central operation in using space-filling curves (SFC) for 
our purposes is a reordering of the mesh using either the 
Morton or Peano-Hilbert order Both orderings have been 
explored in scientific computing in a variety of roles, includ- 
ing the parallel solution of A-body problems in computa- 
tional physics t 23 !, algebraic multigrid^, mesh 

generation,^ in the solution of PDEs and dynamic reparti- 
tioning of adaptive methods^. Figure 1 shows both Peano- 
Hilbert and Morton SFCs constructed on Cartesian meshes at 
three levels of refinement. In two dimensions, the basic build- 
ing block of the Hilbert curves is a “U” shaped line which 
visits each of 4 cells in a 2 x 2 block. Each subsequent level 
divides the previous level’s cells by nested dissection, creat- 
ing subquadrants which are, themselves, visited by U shaped 
curves as well. Similar properties exist for the Morton order- 
ing which uses an “NT shaped curve as its basic building 
block. 

In three spatial dimensions the curves follow the same basic 
construction rules, but the basic building block extends into 
the third dimension with additional U or N-shaped turns. Fig- 
ure 2 illustrates this 3-D construction for the U-order show- 
ing: (a) the basic 2x2x2 building block (b) the same mesh 
after one uniform refinement, each segment of the curve is 
replaced with an appropriately rotated basic building block, 
(c) the curve after additional adaptive refinement of cells near 
the back-south-west comer. Properties and <i-dimensional 
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a) 


b) 


Figure 1: Space-filling curves used to order three Cartesian 
meshes in two spatial dimensions: a) Peano-Hilbert or “U- 
ordering”, b) Morton or “N-ordering”. 

construction rules for these space-filling curves are discussed 
extensively in refs. [27] and [28]. 

Both orderings have locality properties which make them 
attractive as mesh parti tioners^ 26 ^ 25 ^. For the present, we 
note only that such orderings have 3 important properties. 





Figure 2: U-order in three dimensions for (a) a basic 2x2x2 
block of cells, (b) the same block after uniform subdivision 
(c) cell order after refinement near the south-west-back cor- 


1. Mapping M : ££ -* U: The U and N orderings each 
provide unique mappings from the d-dimensional 
physical space of the problem domain 9 d to a one- 
dimensional hyperspace, V , which one traverses fol- 
lowing the curve. In the U-order, two of a cell’s face 
neighbors in physical space will remain adjacent to 
the cell in the one-dimensional hyperspace. 

2. Locality: In the U-order, each cell visited by the 
curve is directly connected to two face-neighboring 
cells which remain face-neighbors in the one dimen- 
sional hyperspace spanned by the curve. Locality in 
N-ordered domains is almost as good^ 22 l 

3. Compactness: Encoding and decoding the Hilbert or 
Morton order requires only local information. Follow- 
ing the integer indexing for Cartesian meshes outlined 
in ref. [5], a cell’s 1-D index in U may be constructed 
using only that cell’s integer coordinates in 9? and the 
maximum number of refinements that exist in the 
mesh. This aspect is in marked contrast to other parti- 
tioning schemes based on recursive spectral bisection 
or other multilevel decomposition approaches which 
require the entire connectivity matrix of the mesh in 
order to perform the partitioning. 

To illustrate the property of compactness , consider the posi- 
tion of a cell i in the N-order. One way to construct this map- 
ping would be from a global operation such as a recursive 
lexicographic ordering of all cells in the domain. Such a con- 
struction would not satisfy the property of compactness. 
Instead, the index of i in the N-order may be deduced solely 
by inspection of cell Vs integer coordinates (x it y iy z,-). 

Assume (jc<, y/, zi) is the bitwise representation of the integer 
coordinates (x y it z,) using m-bit integers. The bit sequence 
{x! y\ zl } denotes a 3-bit integer constructed by interleaving 
the first bit of j t ( -, y t and z,*. One can then immediately com- 
pute cell Vs location in U as the 3/n-bit integer 
{xiyiZixtytzt ...xTyt zT} . Thus, simply by inspection of a 
cell’s integer coordinates, we are able to directly calculate its 
location, M(i), in the one-dimensional space U without any 



Figure 3: Morton order of an adaptively refined Cartesian mesh 
around a 2-D airfoil. 
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additional information. Similarly compact construction rules 
exist for the U-order^. 

In an /i-refined Cartesian mesh, the finest cell can be used to 
define the dimensions of a single voxel. All coarser cells may 
then be reinterpreted as collections of this basic building 
block. This interpretation leads to an integer indexing scheme 
which can be used to address all cells in the computational 
domain. With this integer indexing scheme, the construction 
rules in the previous paragraph can then be applied to gener- 
ate the Morton index, M(i), of all the cells in the mesh. Figure 
3 shows an example of the N-ordering on an adaptively 
refined mesh around a 2-D airfoil. 

Construction of the Peano-Hilbert index, H(i), follows a sim- 
ilarly compact procedure. After computing the SFC indices 
M(i) or H(i) for all the cells in the mesh, one simply takes 
these indices as sort keys and applies any one of the standard 
sorting algorithms (we use the quickersort algorithm from the 
C standard library). Since all the other data required for con- 
struction is local, the sort establishes the asymptotic bound 
for runtime of the reordering and choosing quick/quicker sort 
gives runtimes of 6{N log N). 

3 Mesh Coarsening 

The recursive nature of the SFC ordering makes it a natural 
choice for a mesh coarsener for /i-refined meshes. In the 
same way that higher-order SFCs are generated by replacing 
segments with the basic U or N building block, the children 
produced by A-refinement replace their parent cell in the 
mesh. Since these children are sorted according to the local 
SFC they will all be nearest-neighbors on a contiguous seg- 
ment of the SFC in the space of the curve U . 

Figure 4 illustrates this observation in two-dimensions. Fig- 
ure 4a shows the baseline multilevel mesh as it comes out of 
either the mesh generator or adaptation module. Frames (b) 
and (c) show the same mesh after two successive coarsen- 
ings. Cells in these meshes are ordered using the N-ordering, 
and their indices in the SFC order are shown. In fig. 4a cells 
22, 23, 24, and 25 all lie adjacent to one another in the 1- 
dimensional hyperspace of the curve, U. 

The coarsening algorithm proceeds with a single traversal of 
U with a running index i and testing if the cell at U(i) is con- 
tained within the same parent as the cell at U(i +1). When a 
contiguous group of cells are found that all coarsen into the 
same parent cell they are agglomerated into that parent. If 
any of the siblings in a contiguous set are further subdivided, 
then all siblings of the set are “not coarsenable” and get 
injected into the coarse mesh without modification. 

The first coarsening of the mesh in fig. 4a produces the mesh 
in fig. 4b. Cells 22-25 are coarsened in this first pass to pro- 
duce cell 10 on the mesh in frame (b). Note that while cells 
20, 21 and 26 are all children of the same parent, they are not 
coarsenable since 22-25 are one-level finer. The next coarsen- 
ing pass produces the mesh in frame (c). Since cells 8-9 on 
mesh (b) were all siblings, they now coarsen to produce cell 5 
on the mesh shown in frame (c). Further details of this algo- 
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Figure 4: Mesh coarsening example using Morton ordering. The 
fine mesh in (a) is coarsened 2 times using the same space-fill- 
ing-curve. 

rithm are available in Ref. [30] with details of the treatment 
of split-cells described in [31]. 
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Figure 5: Mesh coarsening example in 3D using agglomeration 
along the SFC. The finest mesh contains 4.5 M cells while the 
coarsest contains 4500 cells after 4 coarsenings. 

Figure 5 shows an example in three dimensions. In this 
example, a 4.5 M cell, adaptively refined mesh around two 
surface ships has undergone 4 cycles of coarsening by 
agglomeration along the SFC. The final mesh shown in the 
lower right frame of the figure contains 4500 cells. In prac- 
tice, typical coarsening ratios for the algorithm are in excess 
of 7 on realistically complex problems. 

4 Domain decomposition 

The mapping and locality properties that are exploited for the 
single-pass mesh coarsener described above make SFCs a 
natural choice for partitioners on hierarchical 
meshes J 24 ^ 25 ^ 30 ^ 31 ! Figure 6 illustrates these mapping and 
locality properties for an adapted two-dimensional Cartesian 
mesh, partitioned into three subdomains. The figure points 
out that for adapted Cartesian meshes, the hyperspace U may 
not be fully populated by cells in the mesh. However, since 
cell indices in U may be explicitly formed, this poses no 
shortcoming. 



Figure 6: An adapted Cartesian mesh and associated space-filling 
curve based on the U-ordering of M : 01 — ► U with the U- 
ordering illustrating locality and mesh partitioning in two spa- 
tial dimensions. Partitions are indicated by the heavy dashed 
lines in the sketch. 


The quality of the partitioning resulting from U-ordered 
meshes have been examined in ref. [26] and were found to be 
competitive with respect to other popular partitioners. 
Weights can be assigned on a cell-by-cell basis. One advan- 
tage of using this partitioning strategy stems from the obser- 
vation that mesh refinement or coarsening simply increases 
or decreases the population of U while leaving the relative 
order of elements away from the adaptation unchanged. Re- 
mapping the new mesh into new subdomains therefore only 
moves data at partition boundaries and avoids global re map- 
pings when cells adaptively refine during mesh adaptation. 
Recent experience with a variety of global repartitioners sug- 
gest that the communication required to conduct this remap- 
ping can be an order of magnitude more expensive than the 
re partitioning itself^. Additionally, since the partitioning is 
basically just a re-ordering of the mesh cells into the U-order, 
the entire mesh may be stored as a single domain. At run- 
time, this single mesh may then be partitioned into any num- 
ber of subdomains on-the-fly as it is read into the flow solver 
from mass storage. In a heterogeneous shared computing 
environment where the number of available processors may 
not be known at the time of job submission, the value of such 
flexibility is self-evident. 

The SFC reordering pays additional dividends in cache-per- 
formance. The locality property of the SFC produces a con- 
nectivity matrix which is tightly clustered regardless of the 
number of subdomains. Our numerical experiments suggest 
that SFC reordered meshes have about the same data cache 
hit-rate as those reordered using reverse Cuthill-McKee. 

Figure 7 shows an example of a three dimensional Cartesian 
mesh around the full Space Shuttle Launch Vehicle (SSLV) 
configuration. This complex configuration includes the 
orbiter, external tank, solid rocket boosters, and fore and aft 
attach hardware. The computational mesh contains about 4.4 
M cells at 14 levels of refinement, and is indicated by a single 
cutting plane passed through the mesh just behind the SSLV 
geometry. The mesh is displayed partitioned into 16 subdo- 
mains using the U-order. Reordering this mesh with the algo- 
rithm of §2 required 40 sec. on a 2 Ghz Pentium 4, and 
preparing 4 levels of coarser meshes for multigrid required 
30 sec. on the same machine. In determining partition bound- 
aries in this particular example, cut-cells were weighted 2. lx 
as compared to un-cut Cartesian hexahedra. 

The partitioning in fig. 7 demonstrates how, even on adap- 
tively refined meshes, the partitioning tends to produce sub- 
domains that are largely rectilinear. On a uniform mesh, an 
appropriately chosen number of partitions would result in a 
cubical decomposition of the computational domain. This 
“best case” establishes the minimum ratio of communication 
to computation (surface/volume ratio) for SFC partitioned 
meshes. For any given number of cells, one can conceive of 
an idealized cubical partitioner which would communicate 
across some number of faces, F c given by: 





3 OF 6 




Extended Abstract for 42nd AIAA Aerospace Sciences Meeting and Exhibit 



Figure 7: 4.4 M cell mesh around full SSLV configuration 
including orbiter, external tank, solid rocket boosters, 
and fore and aft attach hardware. Mesh color indicates 
partitioning (16 partitions, U-ordering). 


where N ce u s is the total number of cells in the domain and N p 
is the number of subdomains. With this as a guide one can 
examine the quality of the mesh partitioning provided by the 
N and U-orderings. 


Figure 8 examines the average partition statistics for a 1.6M 
cell adaptively refined mesh decomposed into 64 subdo- 


eot>: 



PartilKXHog 

Figure 8: Partitioning statistics for a 1.6M cell adapted carte- 
sian mesh decomposed into 64 subdomains with two dif- 
ferent SFCs and the idealized cubical partitioner described 
in the text. 



Figure 9: Isobars in discrete solution for SSLV configuration at 
Mo, - 2.6, a = 2.09°, p = 0.8°. This case was used for seal- 
ability results. 


mains. Even with this large number of subdomains on a rela- 
tively small mesh, this figure shows that both SFC 
parti tioners perform well, with the Peano-Hilbert decomposi- 
tion showing only a 4% overhead, (in the final paper more 
exhaustive statistics will be gathered , showing how these 
results vary with number of partitions, and mesh size. We will 
also examine the average AND worst partitions to provide 
better insight into predicted runtimes). 

4.1 Scalability and Performance 

Scalability tests were conducted using the 4.4M cell mesh 
from fig. 7. Flow conditions for the test had the SSLV at 
Moo = 2.6, a = 2.09°, P = 0.8°. Isobars of the discrete solution 
are shown in fig. 9. The inviscid multilevel solver from [30] 
and [31] was used with the MPI based communication out- 
lined in [32] and now extended to support multigrid acceller- 
ation. 

Figure 10 shows scalability results for this case on an SGI 
Origin 3000 (600Mhz, Mips R 14000 CPUs) using multigrid 
with MPI. Performance is essentially linear across the experi- 
ment with parallel speedups of about 430 showing on 512 
epus. The measured performance starts to tail-off above 128 
CPUs due to the decrease in the ratio of computation to com- 
munication as the subdomains shrink. On 512 CPUs there are 
only about 8500 cells in each partition leading to an exces- 
sive amount of communication. On 128 CPUs the code is 
achieving parallel speedups of approximately 120, and there 
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Figure 10: Parallel scalability on 4.4 M cell SSLV test case on 
SGI Origin 3600 with 600Mhz Mips R14000 processors. 

are just under 35,000 cells per partition. (More detailed anal- 
ysis will be included in the final paper). 

A second example with this configuration was run on a much 
smaller distributed memory machine to assess performance 
on more commonplace hardware. Figure 1 1 shows scalability 
on a 16 CPU cluster of dual -processor Pentium 4 processors 
each clocked at 2.4Ghz. The eight boxes are connected by a 
dedicated gigabit LAN through a eight port switch. This 
example used a smaller test case with only 1.8 M cells in the 
mesh (same SSLV configuration). On this architecture, per- 
formance remains linear, and parallel speedups between 12 
and 13 are achieved on 15 CPUs, (further analysis planned 
for final paper) 

5 Inter-mesh interpolation 

The final paper will include a description of a completely 
new linear time 3D interpolation algorithm including runt- 
imes. This algorithm permits solution transfer between 
remeshed solution domains even when the geometry and 


t JVteatra 4 . Ethernet. IXM >Tcti » 



Figure 11: Parallel scalability on 1.6 M cell SSLV test case on 
dual-CPU PC cluster with gigabit interconnect. 


mesh are radically changed. The space filing curve permits 
this 3D interpolation problem to be solved in seconds on a 
PC. Applications will include: 

• demonstration of algorithm as interpolation operator when 
geometry is perturbed . e.g. control surface deflections & 
design applications 

• demonstration of use as an interpolation operator for 
remeshed solution domains in time-dependent moving-body 
simulations. 
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This work is an extended abstract to be submitted to the 42nd 
Aerospace sciences meeting and exhibit session on applied CFD. The 
conference will be in Reno NV Jan 6-11, 2004. 

The proposed paper presents a variety novel uses of Space-Filling- 
Curves (SFCs) for Cartesian mesh methods in CFD. While these 
techniques will be demonstrated using non-body-fitted Cartesian 
meshes, most are applicable on general body-fitted meshes - both 
structured and unstructured. We demonstrate the use of single 0(N log 
N) SFC-based reordering to produce single-pass (0(N)) algorithms for 
mesh partitioning, multigrid coarsening, and inter-mesh interpolation . 
The intermesh interpolation operator has many practical applications 
including “warm starts” on modified geometry, or as an inter-grid 
transfer operator on remeshed regions in moving-body simulations. 
Exploiting the compact construction of these operators, we further 
show that these algorithms are highly amenable to parallelization . 
Examples using the SFC-based mesh partitioner show nearly linear 
speedup to 512 CPUs even when using multigrid as a smoother. Partition 
statistics are presented showing that the SFC partitions are, on- 
average, within 10% of ideal even with only around 50,000 cells in 
each sub-domain. Results are also presented on distributed computing 
clusters. The inter-mesh interpolation operator also has linear 
asymptotic complexity and can be used to map a solution with N 
unknowns to another mesh with M unknowns with 0(max(M,N)) operations. 
This capability is demonstrated both on moving-body simulations and in 
mapping solutions to perturbed meshes for finite-difference-based 
gradient design methods. 



